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Abstract 

Linkage maps enable the study of important biological questions. The construction of high-density linkage maps appears 
more feasible since the advent of next-generation sequencing (NGS), which eases SNP discovery and high-throughput 
genotyping of large population. However, the marker number explosion and genotyping errors from NGS data challenge 
the computational efficiency and linkage map quality of linkage study methods. Here we report the HighMap method for 
constructing high-density linkage maps from NGS data. HighMap employs an iterative ordering and error correction 
strategy based on a k-nearest neighbor algorithm and a Monte Carlo multipoint maximum likelihood algorithm. Simulation 
study shows HighMap can create a linkage map with three times as many markers as ordering-only methods while offering 
more accurate marker orders and stable genetic distances. Using HighMap, we constructed a common carp linkage map 
with 10,004 markers. The singleton rate was less than one-ninth of that generated by JoinMap4.1. Its total map distance was 
5,908 cM, consistent with reports on low-density maps. HighMap is an efficient method for constructing high-density, high- 
quality linkage maps from high-throughput population NGS data. It will facilitate genome assembling, comparative 
genomic analysis, and QTL studies. HighMap is available at http://highmap.biomarker.com.cn/. 
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Introduction 

Linkage maps, especially liigh-density ones, play an important 
role in the study of genetics and genomics. Application of high- 
density linkage maps has greatly facilitated discovery of functional 
genes [1], genome assembly [2-6], and comparative analysis of 
genome structure [7-9] . However, most current maps harbor only 
about hundreds of markers, largely plagued by marker discovery 
technologies and genotyping costs. The advent of next-generation 
sequencing (NGS) makes it possible to rapidly discover huge 
numbers of markers. The genotyping approaches based on NGS, 
such as SLAF-seq (specific-locus amplified fragment sequencing) 
[10], RAD (restriction site associated DNA) genotyping [11], and 
genotyping-by-sequencing [12] are even capable of discovering 
and genotyping hundreds of thousands of genetic markers 
throughout the genome at relatively low cost [13]. These 
revolutionary advances in genotyping technologies provide excit- 
ing opportunities to economically construct increasingly dense 
maps [10,14,15]. However, NGS data still inevitably suffer from 
genotyping errors [16-18], especially when sequencing depths are 
low [19-21] and genotypes are highly heterozygous. The inherent 



features of NGS data impose two major challenges on die 
construction of high-density linkage map: First, genotyping errors 
affect the map quality [22]. Second, the marker density explosion 
leads to the exponential increase in computational intensity [22]. 

Great efforts have been made to study algorithms for 
constructing high-density and high-quality linkage map [22-24]. 
RECORD has been developed to produce accurate marker orders 
in a relatively short time by employing the total number of 
observable recombination events between adjacent markers as a 
target function [24]. SMOOTH has been reported to eliminate 
genotyping errors from genetic linkage data during the mapping 
process and improve map quality [22]. However, neither 
RECORD nor SMOOTH is capable of handling populations 
with high heterozygous loci. OneMap [25] and FsLinkageMap 
[26] have been developed to construct linkage maps of high 
heterozygous species. However, OneMap is computationally 
intensive and FsLinkageMap is incapable of constructing high- 
density linkage map. JoinMap4.1 employs a Monte Carlo 
multipoint maximum likelihood algorithm and greatly expedites 
computational speed in marker ordering [27]; nonetheless, it still 
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suffers from the limit of the marker number in hnkage grouping 
[28], and serious expansion of map distance. The problems caused 
by genotyping errors and density explosion still remain great 
challenges for constructing high-density linkage map efficiendy 
and accurately. 

Several practical strategies have been used to tackle the 
difficulties in constructing high-density linkage map in species 
such as sunflower [29], mouse [7], porcine [30], Brassica napus 
[31], maize [32], spotted gar [28] and potato [33]. Sunflower 
Unkage map integrated four individual linkage maps [29] to 
improve marker densities. The integration strategy is laborious 
and quality suspicious. The hnkage map of mouse and pig were 
constructed by direcdy using the physical order of marker in the 
genomes to circumvent the intensive computation of marker 
ordering [7,30]. This strategy only works for the construction of 
species which have genome reference sequence. A bin strategy has 
been used to construct the linkage map of potato [33], Brassica 
napus, maize [32] and spotted gar [28]. A "bin" is a group of 
markers with a unique segregation pattern and is separated from 
adjacent bins by a single recombination event. The bin strategy 
reduces computational costs as well as impacts of genotyping 
errors, but at the cost of incomplete utilization of genotyping data 
and recombination information, reducing the application value of 
high-density Knkage map. All the above linkage maps enabled the 
biology studies in these species, but the methods of map 
construction stiU suffered from computational time, map quality 
and the utilization of genotyping data. 

Here, by using an iterative ordering and error correction 
strategy, we present an efiicient method that simplifles and 
enhances the construction of high-density, high-quality linkage 
map from high-throughput population NGS data (HighMap). Our 
studies reveal that HighMap has excellent performance of high- 
density linkage map construction. HighMap provides an impor- 
tant tool for understanding genetics and genomics. 

Material and Methods 

All experimental procedures were conducted in conformity with 
institutional guidelines for the care and use of laboratory animals 

in Centre for Applied Aquatic Genomics of the Chinese Academy 
of Fishery Sciences. The protocol was approved by the Committee 
on the Ethics of Animal Experiments of the Centre for Apphed 
Aquatic Genomics at Chinese Academy of Fishery Sciences 
(2011AA1004020012). 

HighMap overview 

Here we report a new strategy, the iterative ordering and error 
correction, to construct high-density genetic maps. We referred to 
the error correction strategy of SMOOTH [22], and used a k- 
nearest neighbor algorithm to correct genotyping errors and 
impute genotyping missing [34]. We employed the enhanced 
algorithm of Gibbs sampling, spatial sampling and simulated 
annealing (GSS) [27,35] to order markers. GSS marker ordering 
algorithm is computationally efiicient [27], but it generates inflated 
map distances, and has unstable map quality, especially for the 
data high in genotyping errors. To ensure stability of map quality, 
we enhanced GSS by using the summation of adjacent recombi- 
nation fractions (SARF) as objective function and adopted Blocked 
Gibbs sampler after trying different Gibbs sampling methods and 
different objective functions in simulated annealing. HighMap 
consists of four modules, designed for hnkage grouping, marker 
ordering, error genotyping correction and map evaluation, 
respectively (Figure 1). The map evaluation module provides heat 
mapsand haplotype maps for intuitive displays of map quality [36]. 



Linkage grouping 

The grouping module uses the single-linkage clustering algo- 
rithm to cluster the markers into linkage groups, using a pair-wise 
modified independence LOD score as distance metric. Assuming a 
loci pair with segregation type, a\b\xc\d\ and 02^2 x C2<^2, a 
contingency table of genotypes is produced: 
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Here R, C, T, and O rcpresmit row-total, column-total, grand- 
total, and observed number of each genotype, respectively. The 
expected number E in each cell in above contingency table is 
calculated by E = R* C/T. The independence test G statistic, 
with degrees of freedom d being equal to the number of rows 
minus 1 multiplied by the number of columns minus 1 , is given by 
G = ^2J2 [O* In (0/E)]. The modified LOD score is obtained 
from an approximate transformation: 

r^n_ [(4-e2g^)e2(^-3](c/-l)-l-G^ 



With increasing LOD thresholds, different group nodes form, 
and these can be represented as a dendrogram, which branches 
might be huge and deep. Since the algorithm exhaustively searches 
all possible solutions that the linkage groups number is equal to the 
chromosome number, it is computationally extensive, especially 
for marker data of low depth. To speed up the calculation, we 
simplified the tree structure by adopting two new strategies: 
deleting small "group fragments", and merging small descendant 
nodes into large ancestral nodes. 

IVlarl<er ordering and genotyping error correction 

The mapping algorithm applies an iterative process of marker 
ordering and error genotype correction to ensure the accuracy of 
map order and map distances in the presence of missing 
observations and genotyping errors (Figure 1). Prior to iterative 
mapping, recombinant frequencies and LOD scores were calcu- 
lated by two-point analysis. Based on recombinant fractions and 
LOD values, hnkage phases were inferred by using the anchoring 
algorithm [37]. 

Consider a linkage group consisting of n markers, denoted as 
G = {M,|i = 1,2, ...,«}. The recombination frequency matrix is as 
follows: R = {rM,,Mj\iJ=^,2,...,n;i^j}. The marker ordering 
module introduces a combination of statistic techniques, spatial 
sampling, Gibbs sampling and simulated anneahng, to obtain the 
map order with the summation of recombination frequencies and 
estimate' their mutual genetic distanc(;s [27,35]. 

In the first stage' of tlu' ordc'ring procedure, markers are selected 
using spatial sampling as following. One marker is taken at 
random (marker 5i) in a priority order of fuU cross, F2 cross and 
test cross. Markers with a recombination frequency smaller than a 
given sampling radius r are excluded from the marker set. From 
the remaining markers, one marker in the abovementioned 
priority order is selected again (marker 1S2). AU markers with 
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Figure 1. Modules of HighMap algorithm. A: The single-linkage clustering algorithm was used to partition the marker loci into linkage groups 
based on a pain/vise modified independence LOD score for the recombination frequency. B and B': The ordering module combines Gibbs sampling, 
spatial sampling, and simulated annealing algorithm to order markers and estimate map distances. C: The error correction module identified 
singletons according to parental contribution of genotypes and eliminated them from the data using /(-nearest neighbor algorithm. To order markers 
correctly, the processes of ordering and error correction were carried out iteratively. D: Heat maps and haplotype maps were constructed to evaluate 
map quality. 

doi:1 0.1 371 /journal.pone.0098855.g001 



recombination frequencies smaller than the given value r with 5*2 
are deleted from the remaining marker set. The sampling process 
is continued until no markers are left. Finally, a subset of markers 
{Si,S2,---,S„j) is obtained. In this marker set, all pair-wise 
recombination frequencies rSi.S) are greater than the given value r. 

Subsequently, simulated annealing is employed to find the best 
map order. We adopted SARFsls an objective function for it could 
enormously reduce the computations yet show a lower rate of 
convergence than the maximum likeliliood. Calculation of SARF 
for a given sequence of the above sampled loci is performed by 
summation oi SARF m two parent maps, i.e., 



'="lPl j="\P2 

SARF= ^ fM,_i,M,+ ^ rMj_^,Mj 
i=\ j=i 

Where M\ is the i th element of the paternal maps, Mj is the j th 
element of the parental maps, and «ip, and n\p^ are the number of 
markers in Pi and P2 maps respectively. Markers with full cross 
and F2 cross segregation pattern occur in both maps, whereas 
markers with test cross can be observed for that parent only. 
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simulated annealing starts from an initial map order from which 
the sampled markers are permutated randomly. A new neighbor 
order is obtained by placing a random locus into a random 
position, and will be accepted if 



where ASARF = SARF„g„ — SARFgu, T is the acceptance 
control parameter(7'>0) and t/ is a random value in the range 
[0,1]- The annealing scheme of T is the same as that illustrated by 
Jansen and colleagues [35]. The annealing system stops if, in a 
number of successive steps, the newly generated map order is 
rejected. Once the optimal map order of the sample markers is 
obtained, the new order can be used to estimate multipoint 
recombination frequencies of both parents using Blocked Gibbs 
sampling, for basic Gibbs sampler often did not work when high- 
density maps were constructed using small populations or data rich 
in erroneous markers. The updated recombination frequencies 
help to integrate the two parental maps, determine the order of 
test cross markers and optimize the map order in the next cycle of 
simulated annealing. After three to four cycles, an optimal map 
of sampled markers is obtained. In the next map-building round, 
the sampUng radius decreases, and a subset of currently unmapped 
markers is selected and addc-d to the prc\'ious sample. The 
mapping algorithm repeats the previously described stages for the 
new sample. Then the entire system stops when all markers are in 
place. 

In high-density genetic maps, a genotyping error usually 
manifests itself as a singleton. A singleton is a single locus in one 
offspring which is different in parental origin from both its direcdy 
neighboring loci [22]. Singletons are mainly caused by erroneous 
genotypes, but they may also result from other biological 
phenomena, such as double recombination events, gene conver- 
sions, mutations [22]. In the third module, missing genot\'pes were 
identified according to parental contribution of genot)'pes. 
Identified singletons were eliminated from data and identified 
missing genotypes were imputed using /c-nearest neighbor 
algorithm [22]. Here A: is a parameter that can be specified as 
appropriate. The number of singletons in each marker reflects 
marker quality. Markers with singleton ratios exceeding a given 
threshold are labeled as "suspicious markers". To avoid that 
correct markers are deleted by mistake, "suspicious markers" were 
neither corrected nor imputed. In practice, three to four rounds of 
ordering and error correction are required to produce a 
reasonably accurate map order and map distance. 

Simulation study 

Simulation data sets were randomly generated based on a fuU- 
sib family of an outbreeding species using a Perl script. First, 
markers were randomly placed along a single paternal or maternal 
chromosome at random intervals. Then, offspring's genotypes 
were generated using the simulation data of maternal and paternal 
chromosomes. Assuming that no crossover interferences occurred, 
the number of crossover events solely depended on the distance as 
specified by the simulated positions of the loci on the parent 
chromosomes. Missing and erroneous data were independendy 
and randomly distributed along chromosomes. 

Experimental data 

HighMap performance was further confirmed using the 
sequencing data of a real full-sib family of common carp which 
consisted of 211 offsprings [10] (The sequencing data is available 
at http://highmap.biomarker.com.cn/). JoinMap4.1 was incapa- 



ble of grouping large data of markers. To compare ordering 
performance of HighMap with that of JoinMap4. 1 , we used 
HighMap rather than JoinMap4. 1 to cluster the marker data when 
we constructed the linkage map using JoinMap4.1. 

Results 

Enhancing utilization of NGS data 

Depth and quality of sequencing reads fluctuate randomly 
across genomes due to sampling randomness. To ensure genotype 
quahty, reads with low depth of sequencing should be discarded in 
the process of genotype calling [10]. If linkage map software can 
bear more genotype missing and error, it will be possible to make 
use of more sequencing data and to create higher-density linkage 
maps at lower costs. Therefore, an important consideration in this 
study is NGS data utilization, which reflects the performance of a 
linkage study method in NGS era. To assess data utilization of 
HighMap, simulation data were generated from a full-sib family 
consisting of 200 offsprings. To simulate real NGS data accurately, 
missing observations and genotyping errors were introduced 
incrementally and simultaneously as the marker number increased 
from 100 to 1,000 (Figure SI). The data set was produced by 
iteratively appending 100 markers, each time with an increment of 
5% missing observation and 5% erroneous genotyping. 

Comparative analysis revealed that HighMap permitted the 
utilization of more markers than JoinMap4. 1 (Figure 2 A, 2B and 
2C). HighMap could make use of 700 markers and create linkage 
maps with a Spearman rank order correlation coefficient greater 
than 0.9. In contrast, 300 markers led to the correlation coefficient 
smaller than 0.8 when the linkage map was constructed using 
JoinMap4.1. Based on a cutoff value of 0.8 [24], we estimated that 
HighMap could construct a linkage map with three times the 
number of markers as JoinMap4. 1 could. 

HighMap contains an error correction algorithm, which can 
impute missing observations and eliminate erroneous genotyping 
from mapping data in the mapping process. Data showed that the 
algorithm is efficient. Take the dataset of 700 markers as an 
example. We introduced 10.25% of erroneous data and 12.86% of 
missing observations (Table 1). After error correction treatment, 
final erroneous and missing rate reduced to 5.40% and 0.65%, 
respectively. 94.88% of genotyping errors were accurately 
corrected, and 89.13% of missing observations were accurately 
imputed. The performance of HighMap in genotyping error 
correction and missing observation imputation accounts for its 
high data utilization. 

A singleton is a single locus in one offspring which is different in 
parental origin from both its direct neighboring loci [22]. It 
comprehensively reflects linkage map quahty and is useful for 
quality evaluation of linkage maps since true order of marker is 
unavailable for a linkage map obtained from real data. When 
linkage maps were constructed using 700 marker data, the 
singleton rate of HighMap, 0.77%, was much smaller than that 
generated byJoinMap4. 1 , which was 8.54'% (Figure 2D, 2E and 
2F). This result demonstrates that the correction procedure of 
HighMap is effective and efficient, which ensures that HighMap 
with stands high rate of genotyping errors and make more use of 
marker data. 

Marker order accuracy and map distance stability 

To assess the performance of HighMap in marker order 
accuracy and map distance stability, simulation data set was 
generated from the full-sib family consisting of 200 offsprings with 
200 markers, which contained different missing observations or 
different genotyping errors. Results showed that the Spearman 
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Figure 2. NGS data utilization enhancement by HighMap. The X-axis represents marl<er numbers. Tine Y-axis represents Spearman ranl< 
correlation coefficient between estimated map marker order and true marker location for A, B and C, singleton rates for D, E and F, estimated genetic 
map distances for G, H and I, respectively. 
doi:1 0.1 371 /journal.pone.0098855.g002 



correlation coefficient between the true and calculated marker 
order based on HighMap decreased less obviously than that of 
JoinMap4. 1 as the marker error rate increased. The difierences of 
the correlation coefficient between HighMap and JoinMap4.1 
were more pronounced when error rate exceeded 20% (Figure 3). 
This result demonstrated that HighMap could offer linkage maps 
of higher quality than JoinMap4. 1 when there were a large 
proportion of erroneous markers. The singleton rate of HighMap 
grew slowly as error rates increased, whereas the singleton rate 
ascended linearly with JoinMap4.1. HighMap led to only 3.3% of 
the singleton rate when the marker data contained 20% error, 
whereas JoinMap4.1 led to 14.4% of the singleton rate, suggesting 
that HighMap detected and eliminated most genotyping errors 
from the data. Both the correlation and singleton analysis revealed 
that JoinMap4.0 was sensitive not only to erroneous data but also 
to missing data (Figure 3 and Figure S2). It failed to construct 
linkage map due to its inefficiency in estimating linkage phases 
when the error rate exceeded about 14% (Figure 3). Collectively, 
HighMap remarkedly outperformed both JoinMap4.0 and Join- 
Map4. 1 with respect to marker order accuracy. 

Map distance expansion is mainly caused by genotyping errors 
and the map distance reflects the quality of a linkage map. In the 
presence of genotyping errors, it may be necessary to make a 
balance between controlling the expansion of map distance and 
ensuring validity of the marker order. We gave the priority to 
address issue of marker order accuracy for it is more important 
than the map distance [22]. While ensuring accurate marker order, 
HighMap greatly curbs map distance expansion. First, map 
distances that HighMap offered was robust to increasing density of 
markers. It led to a genetic distance of 2,030 cM when marker 
numbers reached 700, whereas JoinMap4.1 led to the genetic 



distance of 48,056 cM (Figure 2G, 2H and 21). In addition, the 
map distance estimated by HighMap was insensitive to the 
increase of genotyping errors. It produced only 430 cM of genetic 
distance when the data with 10% marker errors was used; whereas 
JoinMap4. 1 gave a genetic distance up to 1,925 cM (Figure 3). 
These results not only demonstrate that the error correction is an 
efficient strategy for solving map distance expansion but also 
account for the reason of better data utilization of HighMap 
relative to JoinMap. 

Computational efficiency of map construction 

Computational efficiency is a concern in linkage mapping. Both 
grouping and ordering are important in the construction of high- 
density linkage map. JoinMap stiU suffers from the limit of marker 
number in the linkage grouping [28], and this might hamper its 
application in linkage mapping based on NGS data. HighMap 
allows to handle hundreds of thousands of markers in relatively 
short time, for it was developed to construct linkage maps based on 
NGS data under the Linux operation system, and can also run on 
a server or cluster. We assessed the efficiency of marker ordering, 
by comparing it with currendy available mapping tools including 
JionMap4.0 and JionMap4.1, both of which used the default 
parameters. To this end, we simulated a data set based on a fuU-sib 
family consisting of 200 offsprings. The data sets contained 
markers numbering from 50 to 1,000. The simulation data 
contained neither missing nor erroneous values. Results showed 
that computational time consumption increased as marker 
densities went up (Figure S3). For JoinMap4.0, the time 
consumption was prohibitively large for the construction of a 
linkage map with more than 200 markers. HighMap could create 
a linkage group with 1,000 markers within a single day. We also 
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evaluated the efficiency of marker ordering of OneMap [25] and 
FsLinkageMap [26]. FsLinkageMap [26] couldn't construct a 
linkage group with 1 00 makers; OneMap (using Record, rcd,or ug 
algorithms) [25] cost more than one day for creating a linkage 
group with 200 markers (Data not shown). Therefore, FsLinka- 
geMap and OneMap might not be suitable for constructing high- 
density linkage maps based on NGS data. All experiments were 
completed on a computer with a Xeon processor (2.4 GHz and 
16 Gb memory). 

Application in real population NGS data from common 
carp 

To test HighMap performances on real data, we generated a 
high-density linkage map of common carp based on a full-sib 
family NGS data. The integrated map was comprised of 10,004 
markers with an average of 1 1 fold sequencing depths. Among 
them, 19% were the data with sequencing depths of less than 5 
fold. The segregation patterns for these markers are shown in 
Table SI. Similar to the simulation results, HighMap offered a 
linkage map of higher quality than JoinMap4. 1 . Singleton rate of 
the maps created by HighMap was less than one-ninth of those of 
the maps constructed by JoinMap4. 1 (Table S2). Heat maps 
(Figure S4) and haplotype maps (Figure S5) verified the quality of 
the linkage maps that HighMap produced. The linkage maps 
created by HighMap spanned 5,908 cM in 50 linkage groups, 
closer to that reported previously [38], smaller than that Join- 
Map4.1 gave, which was 55,550 cM (Table S3). We also analyzed 
the Spearman correlation coefficient between the marker order of 
common carp and the genome sequences of zebrafish [6], a close 
relative of the common carp. The data revealed that, for about 
70% of linkage groups of common carp, the correlation coefficient 
based on HighMap was larger than those based on JoinMap4. 1 
(Figure S6), suggesting that HighMap is better than JornMap4. 1 - 
with respect to map accuracy. 

Discussion 

In this study, we intended to develop a method that can 
efficiendy utilize NGS data and ease the construction of high- 
density and high-quality linkage map. The challenges of such an 
effort are associated with the marker density explosion and 
potential genotyping errors, which involve sequencing depth and 
sequence heterozygosity. The higher the heterozygosity is, the 
more the genotyping is prone to error. As was shown in the 
simulation study, the error rate reached up to 34. 1 % for markers 
with abxcd segregation pattern when markers were sequenced at 
one fold depth, for markers with efxeg segregation pattern it 
arrived at 21.3%, and for markers with hkxhk or nnxnp or ImxU 
segregation pattern, the error rate stood at 17.4% (Table 2). To 
address the challenges in the construction of linkage map from 
high-throughput population NGS data, we exploited an iterative 
ordering and error correction strategy as well as optimized GSS 
algorithm. Consequently, HighMap was efficient for constructing 
high-density linkage maps, even using low-depth sequencing data 
where genotyping errors and missing observations were common. 
HighMap offers many advantages over JoinMap4. 1. First, the 
marker order and the map distance are relatively accurate for data 
with large proportion of missing and erroneous markers. Second, 
It is robust to genotyping errors, allowing for the use of genotyping 
data with relatively low sequencing depth and therefore makes it 
possible to construct high-density linkage map at low cost. The 
above advantages demonstrate that the iterative marker ordering 
and error correction strategy is effective and efficient. In addition, 
HighMap provides an intuitive and convenient way to evaluate 
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Figure 3. Changes in linloge map quality as genotyping error increased. The X-axis represents genotyping error. The Y-axis represents 
Spearman ranl< correlation coefficient between estimated map marl<er order and true marl<er location for A, B and C, singleton rates for D, E and F, 
estimated genetic map distances for G, H and i, respectively."lntegrated", "Female", and "Male" indicates integrated, female, or male linkage maps, 
respectively. JoinMap4.0 failed to construct linkage map due to its inefficiency in estimating linkage phases when the error rate exceeded about 14%. 
doi:1 0.1 371 /journal.pone.0098855.g003 



map quality in forms of heat maps and haplotype maps. It also has 
the feature of easy use and does not require the specialized skills. 

Linkage maps are widely used in marker-assisted selection, 
quantitative trait loci mapping, and comparative genome analysis. 
They are also necessary to anchor scaffolds on chromosomes 
during genome assembly. Due to the limitation of marker density 
and population size of linkage maps, there left many scaffolds 
unanchored or unordered in genome assembly recently pubhshed. 
For example, the cacao genome had only 67% and 50% of 



assembled sequences anchored and ordered, respectively [39]. The 
apple (88%, 66%) [40] and grape genomes (69%, 61%) [41] 
exhibited slightly higher utility of scaffolds. Nonetheless, there was 
still more than 30% of the scaffold that could not be ordered onto 
chromosomes. Recentiy, Hyten et al showed that they can orient 
additional 23 scaffolds (totaling 7.1 Mb) [42], which were 
previously unordered, into chromosomes by using a higher-density 
linkage map with larger size of the population, suggesting 
construction of higher resolution genetic maps is critical for 



Table 2. Genotyping error and missing rates of different segregation patterns in NGS. 
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improving genome assembly. In the case of cacao, the assembled 
genome was 326 Mb, and scaflbld N90 was 75.5 kb [39]. To 
ensure 90% of the sequence assembly ordered, every 75.5 kb 
sequence requires at least two markers, and the linkage map 
should offer at least 8,636 markers in total. The linkage map used 
in the study had only 1,259 markers [39], about 7,000 markers 
fewer than what was needed. Therefore there remained up to 
2,207 scaffolds unordered in the cacao study. By providing high- 
density, HighMap will be of great benefit to genome assembly and 
validation of the scaffold placement on the chromosomes. 

In summary, we offer a computationally efficient method for 
linkage mapping using population NGS data. The development of 
HighMap should propel the application of NGS in linkage 
mapping. It is a lasting task to make full use of NGS data at lower 
cost and to construct high-density linkage maps. Great efforts are 
guaranteed to further improve the potentials of NGS data 
utilization in the linkage studies. 

Supporting Information 

Figure SI Simulation data sets containing both the missing and 
erroneous markers. Missing and erroneous rates increased 
simultaneously as markers increased from 100 to 1,000. 

(TIF) 

Figure S2 Changes in linkage map quality as missing observa- 
tion increased. The X-axis indicates missing observation. The Y- 
axis indicates Spearman rank correlation coefficient between 
estimated map marker order and true marker location for A, B 
and C, singleton rates for D, E and F, estimated genetic map 
distances for G, H and I, respectively. "Integrated", "Female", 
and "Male" indicate integrated, female, or male linkage maps, 
respectively. 
(TIF) 

Figure S3 Computational spc'(;d of HighMap. Running time was 
reported as number of 100 seconds. JoinMap4.0 is computation- 
ally demanding when marker data contained more than 200 
markers. 
(TIF) 
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